test plan
VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation
Liu, Ethan TS., Wang, Austin, Mateega, Spencer, Georgescu, Carlos, Tang, Danny
Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and human security experts evaluated each response according to a rigorous scoring rubric emphasizing remediation (quality of the code fix, 50%), explanation (20%), and classification and test plan (30%) according to a standardized rubric. Our results show that current state-of-the-art LLMs achieve only moderate success on VADER - OpenAI's o3 attained 54.7% accuracy overall, with others in the 49-54% range, indicating ample room for improvement. Notably, remediation quality is strongly correlated (Pearson r > 0.97) with accurate classification and test plans, suggesting that models that effectively categorize vulnerabilities also tend to fix them well. VADER's comprehensive dataset, detailed evaluation rubrics, scoring tools, and visualized results with confidence intervals are publicly released, providing the community with an interpretable, reproducible benchmark to advance vulnerability-aware LLMs. All code and data are available at: https://github.com/AfterQuery/vader
- North America > United States > Pennsylvania (0.04)
- North America > United States > Maryland > Montgomery County > Gaithersburg (0.04)
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- (2 more...)
Planning Reliability Assurance Tests for Autonomous Vehicles
Zheng, Simin, Lu, Lu, Hong, Yili, Liu, Jian
Artificial intelligence (AI) technology has become increasingly prevalent and transforms our everyday life. One important application of AI technology is the development of autonomous vehicles (AV). However, the reliability of an AV needs to be carefully demonstrated via an assurance test so that the product can be used with confidence in the field. To plan for an assurance test, one needs to determine how many AVs need to be tested for how many miles and the standard for passing the test. Existing research has made great efforts in developing reliability demonstration tests in the other fields of applications for product development and assessment. However, statistical methods have not been utilized in AV test planning. This paper aims to fill in this gap by developing statistical methods for planning AV reliability assurance tests based on recurrent events data. We explore the relationship between multiple criteria of interest in the context of planning AV reliability assurance tests. Specifically, we develop two test planning strategies based on homogeneous and non-homogeneous Poisson processes while balancing multiple objectives with the Pareto front approach. We also offer recommendations for practical use. The disengagement events data from the California Department of Motor Vehicles AV testing program is used to illustrate the proposed assurance test planning methods.
- North America > United States > California (0.34)
- North America > United States > Florida > Hillsborough County > Tampa (0.14)
- North America > United States > Arizona > Pima County > Tucson (0.14)
- (3 more...)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (1.00)
How to approach conversation design with Amazon Lex: Building and testing (Part 3)
In parts one and two of our guide to conversation design with Amazon Lex, we discussed how to gather requirements for your conversational AI application and draft conversational flows. In this post, we help you bring all the pieces together. You'll learn how draft an interaction model to deliver natural conversational experiences, and how to test and tune your application. In the second post of this series, you identified some use cases that you wanted to automate and wrote sample interactions between a user and your application. In this post, we use these use cases to build an Amazon Lex framework, called an interaction model, but first, let's review some important definitions.
- Banking & Finance (0.49)
- Retail > Online (0.40)
Test and Evaluation Framework for Multi-Agent Systems of Autonomous Intelligent Agents
Lanus, Erin, Hernandez, Ivan, Dachowicz, Adam, Freeman, Laura, Grande, Melanie, Lang, Andrew, Panchal, Jitesh H., Patrick, Anthony, Welch, Scott
Test and evaluation is a necessary process for ensuring that engineered systems perform as intended under a variety of conditions, both expected and unexpected. In this work, we consider the unique challenges of developing a unifying test and evaluation framework for complex ensembles of cyber-physical systems with embedded artificial intelligence. We propose a framework that incorporates test and evaluation throughout not only the development life cycle, but continues into operation as the system learns and adapts in a noisy, changing, and contended environment. The framework accounts for the challenges of testing the integration of diverse systems at various hierarchical scales of composition while respecting that testing time and resources are limited. A generic use case is provided for illustrative purposes and research directions emerging as a result of exploring the use case via the framework are suggested.
- North America > United States > Virginia > Fairfax County > Fairfax (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (2 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (0.68)
Testing Features of ML Models - DZone AI
In this post, you will learn about different types of test cases that you could come up for testing features of the Data Science/Machine Learning models. Testing features are one of the key sets of which needs to be performed for ensuring the high performance of Machine Learning models in a consistent and sustained manner. Features make the most important part of a Machine Learning model. Features are nothing but the predictor variable, which is used to predict the outcome or response variable. Simply speaking, the following function represents y as the outcome variable and x1, x2, and x1x2 as predictor variables.